We will work on a dataset that contains information about bookings from City hotel and Resort hotel. The dataset can be found (https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib6)
Meal: Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)
In this project, we will analyze the dataset to find out the answers to the following questions:
We will start by importing some useful libraries we need in this project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
df=pd.read_csv('hotel_bookings.csv')
the dataset is loaded, now we will gather some basic information about thedataframe.
df.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 7/1/2015 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 7/1/2015 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 7/2/2015 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 7/2/2015 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 7/3/2015 |
5 rows × 32 columns
df['country'].unique()
array(['PRT', 'GBR', 'USA', 'ESP', 'IRL', 'FRA', nan, 'ROU', 'NOR', 'OMN',
'ARG', 'POL', 'DEU', 'BEL', 'CHE', 'CN', 'GRC', 'ITA', 'NLD',
'DNK', 'RUS', 'SWE', 'AUS', 'EST', 'CZE', 'BRA', 'FIN', 'MOZ',
'BWA', 'LUX', 'SVN', 'ALB', 'IND', 'CHN', 'MEX', 'MAR', 'UKR',
'SMR', 'LVA', 'PRI', 'SRB', 'CHL', 'AUT', 'BLR', 'LTU', 'TUR',
'ZAF', 'AGO', 'ISR', 'CYM', 'ZMB', 'CPV', 'ZWE', 'DZA', 'KOR',
'CRI', 'HUN', 'ARE', 'TUN', 'JAM', 'HRV', 'HKG', 'IRN', 'GEO',
'AND', 'GIB', 'URY', 'JEY', 'CAF', 'CYP', 'COL', 'GGY', 'KWT',
'NGA', 'MDV', 'VEN', 'SVK', 'FJI', 'KAZ', 'PAK', 'IDN', 'LBN',
'PHL', 'SEN', 'SYC', 'AZE', 'BHR', 'NZL', 'THA', 'DOM', 'MKD',
'MYS', 'ARM', 'JPN', 'LKA', 'CUB', 'CMR', 'BIH', 'MUS', 'COM',
'SUR', 'UGA', 'BGR', 'CIV', 'JOR', 'SYR', 'SGP', 'BDI', 'SAU',
'VNM', 'PLW', 'QAT', 'EGY', 'PER', 'MLT', 'MWI', 'ECU', 'MDG',
'ISL', 'UZB', 'NPL', 'BHS', 'MAC', 'TGO', 'TWN', 'DJI', 'STP',
'KNA', 'ETH', 'IRQ', 'HND', 'RWA', 'KHM', 'MCO', 'BGD', 'IMN',
'TJK', 'NIC', 'BEN', 'VGB', 'TZA', 'GAB', 'GHA', 'TMP', 'GLP',
'KEN', 'LIE', 'GNB', 'MNE', 'UMI', 'MYT', 'FRO', 'MMR', 'PAN',
'BFA', 'LBY', 'MLI', 'NAM', 'BOL', 'PRY', 'BRB', 'ABW', 'AIA',
'SLV', 'DMA', 'PYF', 'GUY', 'LCA', 'ATA', 'GTM', 'ASM', 'MRT',
'NCL', 'KIR', 'SDN', 'ATF', 'SLE', 'LAO'], dtype=object)
Country names are represented in the ISO format.
df.shape
(119390, 32)
df.columns
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
'arrival_date_month', 'arrival_date_week_number',
'arrival_date_day_of_month', 'stays_in_weekend_nights',
'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
'country', 'market_segment', 'distribution_channel',
'is_repeated_guest', 'previous_cancellations',
'previous_bookings_not_canceled', 'reserved_room_type',
'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
'company', 'days_in_waiting_list', 'customer_type', 'adr',
'required_car_parking_spaces', 'total_of_special_requests',
'reservation_status', 'reservation_status_date'],
dtype='object')
df.isnull().values.any()
True
df.isnull().sum()
hotel 0 is_canceled 0 lead_time 0 arrival_date_year 0 arrival_date_month 0 arrival_date_week_number 0 arrival_date_day_of_month 0 stays_in_weekend_nights 0 stays_in_week_nights 0 adults 0 children 4 babies 0 meal 0 country 488 market_segment 0 distribution_channel 0 is_repeated_guest 0 previous_cancellations 0 previous_bookings_not_canceled 0 reserved_room_type 0 assigned_room_type 0 booking_changes 0 deposit_type 0 agent 16340 company 112593 days_in_waiting_list 0 customer_type 0 adr 0 required_car_parking_spaces 0 total_of_special_requests 0 reservation_status 0 reservation_status_date 0 dtype: int64
We can see that the dataset has some kind of missing values.
it is clear that column "country" has 488, "agent" 16340 and "company" with the highest at 112593. It means that the booking is likely done without any agent and company involved.
df['adults'].unique()
array([ 2, 1, 3, 4, 40, 26, 50, 27, 55, 0, 20, 6, 5, 10],
dtype=int64)
df['children'].unique()
array([ 0., 1., 2., 10., 3., nan])
df['babies'].unique()
array([ 0, 1, 2, 10, 9], dtype=int64)
It seems that here we have some dirtiness in data as "adults", "children", and "babies" cannot be zero at a same time.
filter=(df["adults"]==0) & (df["children"]==0) & (df["babies"]==0)
df[filter]
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2224 | Resort Hotel | 0 | 1 | 2015 | October | 41 | 6 | 0 | 3 | 0 | ... | No Deposit | NaN | 174.0 | 0 | Transient-Party | 0.00 | 0 | 0 | Check-Out | 10/6/2015 |
| 2409 | Resort Hotel | 0 | 0 | 2015 | October | 42 | 12 | 0 | 0 | 0 | ... | No Deposit | NaN | 174.0 | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 10/12/2015 |
| 3181 | Resort Hotel | 0 | 36 | 2015 | November | 47 | 20 | 1 | 2 | 0 | ... | No Deposit | 38.0 | NaN | 0 | Transient-Party | 0.00 | 0 | 0 | Check-Out | 11/23/2015 |
| 3684 | Resort Hotel | 0 | 165 | 2015 | December | 53 | 30 | 1 | 4 | 0 | ... | No Deposit | 308.0 | NaN | 122 | Transient-Party | 0.00 | 0 | 0 | Check-Out | 1/4/2016 |
| 3708 | Resort Hotel | 0 | 165 | 2015 | December | 53 | 30 | 2 | 4 | 0 | ... | No Deposit | 308.0 | NaN | 122 | Transient-Party | 0.00 | 0 | 0 | Check-Out | 1/5/2016 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 115029 | City Hotel | 0 | 107 | 2017 | June | 26 | 27 | 0 | 3 | 0 | ... | No Deposit | 7.0 | NaN | 0 | Transient | 100.80 | 0 | 0 | Check-Out | 6/30/2017 |
| 115091 | City Hotel | 0 | 1 | 2017 | June | 26 | 30 | 0 | 1 | 0 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.00 | 1 | 1 | Check-Out | 7/1/2017 |
| 116251 | City Hotel | 0 | 44 | 2017 | July | 28 | 15 | 1 | 1 | 0 | ... | No Deposit | 425.0 | NaN | 0 | Transient | 73.80 | 0 | 0 | Check-Out | 7/17/2017 |
| 116534 | City Hotel | 0 | 2 | 2017 | July | 28 | 15 | 2 | 5 | 0 | ... | No Deposit | 9.0 | NaN | 0 | Transient-Party | 22.86 | 0 | 1 | Check-Out | 7/22/2017 |
| 117087 | City Hotel | 0 | 170 | 2017 | July | 30 | 27 | 0 | 2 | 0 | ... | No Deposit | 52.0 | NaN | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 7/29/2017 |
180 rows × 32 columns
df[~filter]
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 7/1/2015 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 7/1/2015 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.00 | 0 | 0 | Check-Out | 7/2/2015 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.00 | 0 | 0 | Check-Out | 7/2/2015 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.00 | 0 | 1 | Check-Out | 7/3/2015 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 119385 | City Hotel | 0 | 23 | 2017 | August | 35 | 30 | 2 | 5 | 2 | ... | No Deposit | 394.0 | NaN | 0 | Transient | 96.14 | 0 | 0 | Check-Out | 9/6/2017 |
| 119386 | City Hotel | 0 | 102 | 2017 | August | 35 | 31 | 2 | 5 | 3 | ... | No Deposit | 9.0 | NaN | 0 | Transient | 225.43 | 0 | 2 | Check-Out | 9/7/2017 |
| 119387 | City Hotel | 0 | 34 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | No Deposit | 9.0 | NaN | 0 | Transient | 157.71 | 0 | 4 | Check-Out | 9/7/2017 |
| 119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | No Deposit | 89.0 | NaN | 0 | Transient | 104.40 | 0 | 0 | Check-Out | 9/7/2017 |
| 119389 | City Hotel | 0 | 205 | 2017 | August | 35 | 29 | 2 | 7 | 2 | ... | No Deposit | 9.0 | NaN | 0 | Transient | 151.20 | 0 | 2 | Check-Out | 9/7/2017 |
119210 rows × 32 columns
data=df[~filter]
data.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 7/1/2015 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 7/1/2015 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 7/2/2015 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 7/2/2015 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 7/3/2015 |
5 rows × 32 columns
Now the data is clean and ready for analysis.
df['hotel'].unique()
array(['Resort Hotel', 'City Hotel'], dtype=object)
Since we have two different hotels. We will separate them.
resort=data[(data['hotel']=="Resort Hotel") & (data["is_canceled"]==0)]
city=data[(data['hotel']=="City Hotel") & (data["is_canceled"]==0)]
resort.shape
(28927, 32)
city.shape
(46084, 32)
df = data.groupby('hotel')['hotel'].count()
sns.barplot(x=df.index, y=df)
<AxesSubplot:xlabel='hotel', ylabel='hotel'>
country_data=data[data['is_canceled']==0]['country'].value_counts().reset_index()
country_data.columns=['country','No of guests']
country_data
| country | No of guests | |
|---|---|---|
| 0 | PRT | 20977 |
| 1 | GBR | 9668 |
| 2 | FRA | 8468 |
| 3 | ESP | 6383 |
| 4 | DEU | 6067 |
| ... | ... | ... |
| 160 | BHR | 1 |
| 161 | DJI | 1 |
| 162 | MLI | 1 |
| 163 | NPL | 1 |
| 164 | FRO | 1 |
165 rows × 2 columns
px.choropleth(country_data, locations=country_data['country'],
color=country_data['No of guests'],
hover_name=country_data['country'],
title="Home country of guests")
data['meal'].unique()
array(['BB', 'FB', 'HB', 'SC', 'Undefined'], dtype=object)
data['meal'].value_counts()
BB 92236 HB 14458 SC 10549 Undefined 1169 FB 798 Name: meal, dtype: int64
data.groupby(['hotel','meal']).agg({'meal':'count'}).unstack()
| meal | |||||
|---|---|---|---|---|---|
| meal | BB | FB | HB | SC | Undefined |
| hotel | |||||
| City Hotel | 62233.0 | 44.0 | 6412.0 | 10474.0 | NaN |
| Resort Hotel | 30003.0 | 754.0 | 8046.0 | 75.0 | 1169.0 |
px.pie(data,names=data['meal'].value_counts().index,values=data['meal'].value_counts()
)
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
df = data.groupby(['hotel', 'arrival_date_month'])['adr'].mean().reset_index()
df['arrival_date_month'] = pd.Categorical(df['arrival_date_month'], categories=months, ordered=True)
df.sort_values('arrival_date_month', inplace=True)
df
| hotel | arrival_date_month | adr | |
|---|---|---|---|
| 4 | City Hotel | January | 82.754477 |
| 16 | Resort Hotel | January | 49.507033 |
| 3 | City Hotel | February | 85.327519 |
| 15 | Resort Hotel | February | 55.189716 |
| 7 | City Hotel | March | 92.973339 |
| 19 | Resort Hotel | March | 57.554652 |
| 0 | City Hotel | April | 111.397415 |
| 12 | Resort Hotel | April | 77.849496 |
| 20 | Resort Hotel | May | 78.758134 |
| 8 | City Hotel | May | 121.764614 |
| 6 | City Hotel | June | 119.186056 |
| 18 | Resort Hotel | June | 110.481032 |
| 5 | City Hotel | July | 110.945950 |
| 17 | Resort Hotel | July | 155.181299 |
| 1 | City Hotel | August | 114.857330 |
| 13 | Resort Hotel | August | 186.790574 |
| 11 | City Hotel | September | 110.120296 |
| 23 | Resort Hotel | September | 93.252030 |
| 22 | Resort Hotel | October | 62.132572 |
| 10 | City Hotel | October | 100.119313 |
| 9 | City Hotel | November | 88.372486 |
| 21 | Resort Hotel | November | 48.313643 |
| 2 | City Hotel | December | 89.209560 |
| 14 | Resort Hotel | December | 69.051887 |
plt.figure(figsize=(9,6))
sns.lineplot(x='arrival_date_month', y='adr', hue='hotel', data=df)
plt.ylabel("Average daily price")
plt.xlabel("Months")
p = plt.xticks(rotation=30)
plt.title("Average Daily rate by months")
Text(0.5, 1.0, 'Average Daily rate by months')
We can conclude that people from over the world say in these two hotels and most of guest are from Portugal and other European countries and 77.4% of bookings reserves are for Bed and Breakfast. We can see the mid-year months have higher daily prices as seen from the bookings for the high mid-year months.